Add subscription/defer observability attributes and metrics#8858
Add subscription/defer observability attributes and metrics#8858
Conversation
✅ Docs preview readyThe preview is ready to be viewed. View the preview File Changes 0 new, 2 changed, 0 removedBuild ID: adf329da55809814b64cc9df URL: https://www.apollographql.com/docs/deploy-preview/adf329da55809814b64cc9df |
This comment has been minimized.
This comment has been minimized.
a5c8dad to
b9ce87b
Compare
…errors' of https://github.com/apollographql/router into rohan-b99/subscription-observability-heartbeat-payload-errors
carodewig
left a comment
There was a problem hiding this comment.
This is looking great! One comment/question below
carodewig
left a comment
There was a problem hiding this comment.
I've pushed a commit which does this (although the reason is now stored in a enum in the Multipart struct) - let me know if that's better or if you had something else in mind
I think it's easier to understand now, but happy to be overruled!
| response.has_next.unwrap_or(false) || response.subscribed.unwrap_or(false); | ||
|
|
||
| // Check for reload-related termination before errors are moved | ||
| let end_reason = Self::detect_reload_end_reason(&response.errors); |
There was a problem hiding this comment.
Maybe I'm missing something - but is this variable ever used / stored?
There was a problem hiding this comment.
It was used further down but only for subscriptions if the connection was no longer open - I did a small refactor to make it optional as at this point we don't know for sure if it has closed, the response gets moved a bit lower down though so I think it does need to be checked here
|
Hi @theJC, just tagging you here to see if you had any feedback on this implementation - I understand the initial request for more observability on heartbeats/disconnections for subscriptions/ |
pragl
left a comment
There was a problem hiding this comment.
I've left some suggestions on the docs changes to align with our style guide. Please consider them when you get the chance. Thanks!
Co-authored-by: Parker <parker.ragland@apollographql.com>
| /// Used to detect if a heartbeat was the last thing sent before connection closed. | ||
| heartbeat_pending: bool, | ||
| /// The span captured at creation time, used to record attributes on connection close. | ||
| span: Span, |
There was a problem hiding this comment.
recommend naming this creation_span rather than span
| let end_reason = self.end_reason.take().unwrap_or_else(|| match self.mode { | ||
| ProtocolMode::Subscription => { | ||
| // Stream wasn't terminated properly - determine the reason | ||
| let reason = if self.heartbeat_pending { | ||
| // Heartbeat was the last thing sent - likely failed to deliver | ||
| SubscriptionEndReason::HeartbeatDeliveryFailed | ||
| } else { | ||
| // Connection closed after a message was sent | ||
| SubscriptionEndReason::ClientDisconnect | ||
| }; | ||
| EndReason::Subscription(reason) | ||
| } | ||
| ProtocolMode::Defer => { | ||
| // Defer stream wasn't terminated properly - client disconnected | ||
| EndReason::Defer(DeferEndReason::ClientDisconnect) | ||
| } |
There was a problem hiding this comment.
Would it be cleaner if this complex logic be extracted into a function infer_abnormal_termination_reason in Multipart ?
|
Awesome, this looks to be a nice improvement to TRACE observability into subscription/defer activity !!! An additional importantly needed observability is to have instruments/metrics emanated for the important un-happy conditions, ie:
Its important to be able to track how often these occur to track overall health of the subscription activity happening with Router, so anomalies in these trends can be detected and then acted upon. |
…ened_subscriptions` limit has been reached
...urce/routing/observability/router-telemetry-otel/enabling-telemetry/standard-instruments.mdx
Outdated
Show resolved
Hide resolved
Thanks for your feedback on this! Appreciate the input around metrics - I've implemented these along with some other refactors to make termination reasons clearer |
Adds new span attributes and metrics to improve observability of streaming responses.
Span attributes:
apollo.subscription.end_reason: Records the reason a subscription was terminated. Possible values areserver_close,subgraph_error,heartbeat_delivery_failed,client_disconnect,schema_reload, andconfig_reload.apollo.defer.end_reason: Records the reason a deferred query ended. Possible values arecompleted(all deferred chunks were delivered successfully) andclient_disconnect(the client disconnected before all deferred data was delivered).Both attributes are added dynamically to router spans only when relevant (i.e., only on requests that actually use subscriptions or
@defer), rather than being present on every router span.Metrics:
The following counters are emitted when a subscription terminates:
apollo.router.operations.subscriptions.stream_end(attributes:subgraph.service.name): The subgraph gracefully closed the stream.apollo.router.operations.subscriptions.subgraph_error(attributes:subgraph.service.name): The subscription terminated unexpectedly due to a subgraph error (e.g. process killed, network drop).apollo.router.operations.subscriptions.client_disconnect(attributes:apollo.client.name): The client disconnected before the subscription ended.apollo.router.operations.subscriptions.heartbeat_delivery_failed(attributes:apollo.client.name): A heartbeat could not be delivered to the client.apollo.router.operations.subscriptions.schema_reload: The subscription was terminated because the router schema was updated.apollo.router.operations.subscriptions.config_reload: The subscription was terminated because the router configuration was updated.The following counter is emitted when a subscription request is rejected:
apollo.router.operations.subscriptions.rejected.limit: A new subscription request was rejected because the router has reached itsmax_opened_subscriptionslimit.Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩